DATS 6101 Team 1 - Project 1
Introduction
Our research topic aims to perform explorartory data analysis on the Nearest Earth Objects (NEO) dataset recorded by NASA. This particular dataset includes thousands of labeled observations, over the last 22 years, across multiple attributes that each aim to describe a specific object. Borrowing from the humor of the recent film Don’t Look Up by Adam Mckay and NASA’s success in the Double Asteroid Redirection Test (DART) mission, we thought that further describing and exploring this dataset would be opportune.
The EDA performed can help with the early detection of an asteroid. With the early detection, NASA can take quick remedies to help eliminate the threat. NASA’s DART mission (Double Asteroid Redirection Test) was done for the same purpose. It was designed to deflect an asteroid with a certain momentum by hitting it head-on or attempting to slow it down. More can be read online from this article.
According to NASA’s official documentation, a PHA is technically termed using parameters that measure the asteroid’s characteristics and then make a decision if the asteroid is considered hazardous based on certain thresholds. Specifically, all asteroids with an Earth Minimum Orbit intersection Distance (MOID) of 0.05 au or less and an absolute magnitude (H) of 22.0 or less are considered Potentially Hazardous Asteroids (PHAs). In other words, asteroids that can’t get any closer to the Earth (i.e., MOID) than 0.05 au (roughly 7,480,000km) or are smaller than about 140m in diameter (i.e., H=22.0) are not considered PHAs.
The current dataset that we have contains 90,000 rows. Each row represents a NEO identified by an ID and name. Out of these 90,000 rows, 9,000 of them are hazardous and 81,000 of them are non-hazardous. This dataset does not contain null values. Apart from the hazardous target variable, our dataset contains 5 numerical variables that can help identify if the asteroid is hazardous or not. These 5 variables are:
- est_diameter_min (km) – estimated minimum diameter of the NEO
- est_diameter_max (km) – estimated maximum diameter of the NEO
- relative_velocity – the velocity (in km/seconds) with which the NEO was travelling with respect to Earth
- miss_distance – distance (in km) by which the asteroid missed the Earth’s surface
- absolute_magnitude (H) – signifies the brightness of the asteroid
Out of these 5 variables, we have 3 variables that play a primary role in determining if the asteroid was harmful or not. Those are:
- relative_velocity
- miss_distance
- absolute_magnitude
Throughout the rest of this RMD file, we go over each of these 3 variables in complete depth to answer our SMART question – what are some of the statistical characteristics of Nearest Earth Objects, over the last 22 years, in terms of relative velocity, miss distance and absolute magnitude, that make them hazardous or not?
Let’s explore the dataset
Displaying the structure of the data:
## 'data.frame': 90836 obs. of 10 variables:
## $ id : int 2162635 2277475 2512244 3596030 3667127 54138696 54189957 54230078 2088213 3766065 ...
## $ name : chr "162635 (2000 SS164)" "277475 (2005 WK4)" "512244 (2015 YE18)" "(2012 BV13)" ...
## $ est_diameter_min : num 1.1983 0.2658 0.722 0.0965 0.255 ...
## $ est_diameter_max : num 2.679 0.594 1.615 0.216 0.57 ...
## $ relative_velocity : num 13569 73589 114259 24764 42738 ...
## $ miss_distance : num 54839744 61438127 49798725 25434973 46275567 ...
## $ orbiting_body : chr "Earth" "Earth" "Earth" "Earth" ...
## $ sentry_object : chr "False" "False" "False" "False" ...
## $ absolute_magnitude: num 16.7 20 17.8 22.2 20.1 ...
## $ hazardous : chr "False" "True" "False" "False" ...
## NULL
Displaying the first 6 rows of the dataset:
## id name est_diameter_min est_diameter_max
## 1 2162635 162635 (2000 SS164) 1.1983 2.6794
## 2 2277475 277475 (2005 WK4) 0.2658 0.5943
## 3 2512244 512244 (2015 YE18) 0.7220 1.6145
## 4 3596030 (2012 BV13) 0.0965 0.2158
## 5 3667127 (2014 GE35) 0.2550 0.5702
## 6 54138696 (2021 GY23) 0.0364 0.0813
## relative_velocity miss_distance orbiting_body sentry_object
## 1 13569 54839744 Earth False
## 2 73589 61438127 Earth False
## 3 114259 49798725 Earth False
## 4 24764 25434973 Earth False
## 5 42738 46275567 Earth False
## 6 34298 40585691 Earth False
## absolute_magnitude hazardous
## 1 16.7 False
## 2 20.0 True
## 3 17.8 False
## 4 22.2 False
## 5 20.1 True
## 6 24.3 False
The number of NA values in the dataset => 0
Now, we split the dataset into 2 dataframes - one for hazardous and other for non-hazardous. This is to ease out the tests performed.
The dimensions of hazardous asteroids dataframe: 8840, 10
The dimensions of non_hazardous asteroids dataframe: 81996, 10
Performing measures of central tendency using Summary
Summary of all asteroids is as follows:
## id name est_diameter_min est_diameter_max
## Min. : 2000433 Length:90836 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3448110 Class :character 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 3748362 Mode :character Median : 0.0 Median : 0.1
## Mean :14382878 Mean : 0.1 Mean : 0.3
## 3rd Qu.: 3884023 3rd Qu.: 0.1 3rd Qu.: 0.3
## Max. :54275914 Max. :37.9 Max. :84.7
## relative_velocity miss_distance orbiting_body sentry_object
## Min. : 203 Min. : 6746 Length:90836 Length:90836
## 1st Qu.: 28619 1st Qu.:17210820 Class :character Class :character
## Median : 44190 Median :37846579 Mode :character Mode :character
## Mean : 48067 Mean :37066546
## 3rd Qu.: 62924 3rd Qu.:56548996
## Max. :236990 Max. :74798651
## absolute_magnitude hazardous
## Min. : 9.2 Length:90836
## 1st Qu.:21.3 Class :character
## Median :23.7 Mode :character
## Mean :23.5
## 3rd Qu.:25.7
## Max. :33.2
Distinct Values
Now, there are multiple asteroids with the same “id” and “name” in the dataset. This could happen due to multiple tests being performed on the asteroids or it could be another different reason. We shall remove these in order for the dataset to follow the rules of a relational dataset. We use the group_by function provided in R to do so and we take the mean of those asteroids with the same id and name so we don’t have any risk of losing the data.
The dimension of the hazardous dataframe after using only distinct asteroid ID’s => 2173, 5
The dimension of the non_hazardous dataframe after using only distinct asteroid ID’s => 25250, 5
Relative Velocity
Relative velocity of an NEO is defined as its velocity compared to the earth. It is measured in kilometers per second.
alt text here
| relative_velocity | |
|---|---|
| Min | Min. : 5908 |
| Q1 | 1st Qu.: 43018 |
| Median | Median : 58658 |
| Mean | Mean : 62794 |
| Q3 | 3rd Qu.: 78786 |
| Max | Max. :193387 |
| relative_velocity | |
|---|---|
| Min | Min. : 203 |
| Q1 | 1st Qu.: 27636 |
| Median | Median : 42566 |
| Mean | Mean : 46479 |
| Q3 | 3rd Qu.: 61062 |
| Max | Max. :236990 |
##
## Welch Two Sample t-test
##
## data: hazardous$relative_velocity and safe$relative_velocity
## t = 54, df = 10456, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 15724 16906
## sample estimates:
## mean of x mean of y
## 62794 46479
## [1] 46311 46647
## attr(,"conf.level")
## [1] 0.95
## [1] 46258 46700
## attr(,"conf.level")
## [1] 0.99
## [1] 62228 63361
## attr(,"conf.level")
## [1] 0.95
## [1] 62050 63539
## attr(,"conf.level")
## [1] 0.99
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| hazardous | 1 | 2.12e+12 | 2.12e+12 | 3446 | 0 |
| Residuals | 90834 | 5.60e+13 | 6.16e+08 | NA | NA |
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = relative_velocity ~ hazardous, data = neo)
##
## $hazardous
## diff lwr upr p adj
## True-False 16315 15770 16860 0
For basic descriptive statistics, measures of central tendency, mean and median, are all higher in the “hazardous” group compared to the objects deemed “safe.”
Using a 2-sample T-Test shows that the difference in means in two groups are significant at the 95% confidence level, as the P-value is smaller than .05.
After constructing confidence intervals at 95% and 99%, we see that the HIGHEST relative velocity in the “safe” condition is lower than the LOWEST relative velocity in the “hazardous” condition, giving more support to the correlation of high relative velocity associated with whether a NEO is hazardous.
## Outliers identified: 153
## Proportion (%) of outliers: 1.8
## Mean of the outliers: 145497
## Mean without removing outliers: 62794
## Mean if we remove outliers: 61338
## Outliers successfully removed
## Outliers identified: 1367
## Proportion (%) of outliers: 1.7
## Mean of the outliers: 126135
## Mean without removing outliers: 46479
## Mean if we remove outliers: 45129
## Outliers successfully removed
Relative velocity is one of the observable characteristics of the Kaggle dataset that other researchers in the field have looked at previously (Araujo 2014). Their research sought to identify the frequency of that asteroids would cross into potentially critical radii (“close encounters”) and, alongside impact parameter, relative velocity was one of the components the researchers examined. The researchers also used histograms to showcase the relative velocity component of their exploratory data analysis. With just a brief glance, the histograms clearly show that asteroids of the Atens groups suffered encounters with a higher relative velocity compared to the Amor group. Their overall findings were similar to ours, with asteroids and their satellites having a higher relative velocity representing the greatest probability of surviving the critical radii in the close encounter.
To estimate an asteroid’s destructive damage to Earth and her human-built infrastructure, we have baseline comparisons (as horrible as it is) in the area of high-yield nuclear weapons. To approximate asteroid impact energy as comparable to the yield of a high-explosive weapon in the megaton range, we would need to know the asteroid mass. Basic physics kinematics states that kinetic energy is a function of mass and velocity squared, KE= ½ MV^2. A one megaton explosion measures the explosive energy of one million tons of TNT, Trinitrotoluene, and is 4.18x10^15 Joules (Krehl 2008). The atomic bomb that dropped on Hiroshima was only 15 kilotons. Asteroids would yield easily in the megatons (Tonry 2010). Research on the former also requires mass to finalize the estimate of energy, and the scientists have a method to derive mass via density (usually taken as ∼2–3 g cm-3). They use the information to estimate the destructive capacity, as well as proposing early warning detection systems using what they know about asteroid destructive impact. Our work would equally benefit if we had asteroid mass (or a way to derive it).
Miss Distance
1. Descriptive statistics:
The miss distance, in kilometers relative to earth, was one of the
variables of interest. We believe that this was one of the key factors
that might influence the decision of whether an asteroid is hazardous or
not. First, we began by sub-setting the data into neo_tyler
with the variables of interest:
name,miss_distance and hazardous.
Further, we thought that it would seem appropriate to check whether the
data confirms the range, of distance relative to earth, that Potentially
Hazardous Asteroids PHAs have: “Potentially Hazardous Asteroids: NEAs
whose Minimum Orbit Intersection Distance (MOID) with the Earth is
0.05 au or less” (NASA) Therefore, by converting the
units from kilometers to astronomical units (au) and using the
summary() function, we are able to confirm that these are
indeed PHAs as defined by NASA. For the matters of this project, we
thought that expressing the variable miss_distance in
millions of kilometers relative to earth would be more comprehensible to
the audience.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.115 0.253 0.248 0.378 0.500
After properly sub-setting the data, we noticed from the excel file
that there are 27423 unique values. Thus, we thought that it would be
most opportune to take an average of the multiple
miss_distance values that appeared from the data set. To do
this average by unique values, we used the group_by()
function, which allowed us to simplify the data to our needs.
Finally, we further subsetted grouped data to hazardous
neo_h_tyler and not hazardous neo_nh_tyler. In
this manner, we can properly seek to answer our S.M.A.R.T. question in
hand. Starting first by observing the descriptive statistics with the
summary() function for each subset. As observed from these
figures, the range is roughly the same for both subsets, due to the NASA
definition of PHAs, and the mean of the hazardous data is slightly
higher than those not hazardous. Also, we thought that it would be
opportune to look at the variation inside of the subsetted data, by
using the sd() function, and as it turns out, hazardous
asteroids tend to be less variable than those that are not. Given that
there is a slight difference in the means and variability, we thought
that this would be the best moment visualize this output by means of a
boxplot and histograms.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8 27.2 38.0 35.8 45.4 74.7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 12.7 32.2 30.0 43.9 74.8
## [1] 14.4
## [1] 18.4
2. Graphical interpretations of the data:
(Notation: h - for hazardous subset and nh- for not hazardous subset)
- Boxplots: As expected, the boxplots show that slight difference in means for h is slightly higher than for those nh. Additionally, the ranges are identical for both groups, however their frequencies might differ, so let’s take a look at the relative frequency histograms.
- Relative Frequency Histograms: As for the Relative Frequency Histograms, there are both similitudes and differences in terms of relative frequency for both the h and nh subset. The h data appears to follow a normal distribution, with the majority of the frequency revolving around the 40 million kilometer mark, however there are higher frequencies in the 0-20 million kilometer mark. Similarly, the nh presents the same features as the h data, however the values in the 0-20 million kilometer mark are more exaggerated. Therefore, since the distribution for h looks closely to a normal distribution, we thought that this would be the most appropriate moment to run normality tests.
3. Normality Tests:
- QQ-plots: As expected from the relative frequency histograms, the
QQ-plot for h closely follows the QQ-line (meaning that is follows a
normal distribution), however the values clearly depart in that 0-20
million kilometer range. On the other hand, the nh data is clearly under
dispersed, visualized by the s-shape in the QQ-plot, meaning that there
is excess negative kurtosis in the subsetted data. Now, that we have
looked at the visual component of these subsets, we thought it would be
best to revisit the difference in the means to check if h and nh have
the same sampling distribution for when it comes to
miss_distance.
4. T-test/Intervals
Two-tailed t-tests:
- @ 95% confidence level: The lower confidence level for h (which recall had the highest mean between h and nh) was of 35.2 millions of kilometers, and as for the highest confidence level for nh was of 30.2 millions of kilometer. Since there is a 5 million kilometer difference in the means, we thought that it would be best to check at a higher confidence level.
##
## One Sample t-test
##
## data: neo_h_tyler$miss_distance
## t = 116, df = 2172, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 35.2 36.5
## sample estimates:
## mean of x
## 35.8
##
## One Sample t-test
##
## data: neo_nh_tyler$miss_distance
## t = 259, df = 25249, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 29.8 30.2
## sample estimates:
## mean of x
## 30
-@ 99% confidence level: The lower confidence level for h was of 35.1 millions of kilometers, and as for the highest confidence level for nh was of 30.3 millions of kilometer. This 4.8 million kilometer difference indicates that these means might not come from the same sampling distribution. Thus, we further explored this differences with an ANOVA test.
##
## One Sample t-test
##
## data: neo_h_tyler$miss_distance
## t = 116, df = 2172, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 35.1 36.6
## sample estimates:
## mean of x
## 35.8
##
## One Sample t-test
##
## data: neo_nh_tyler$miss_distance
## t = 259, df = 25249, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 29.7 30.3
## sample estimates:
## mean of x
## 30
5. ANOVA & Post-Hoc Tukey:
ANOVA: The ANOVA test yielded highly significant results. As indicated by the extremely small p-value of 2e-16, these means do not come from the same sampling distribution, and therefore are different in their nature. So, to get a more accurate reading on these differences we ran a Post-Hoc Tukey test to know by how much precisely do these means differ.
## Df Sum Sq Mean Sq F value Pr(>F)
## grouped_data$hazardous 1 68168 68168 207 <2e-16 ***
## Residuals 27421 9030616 329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Post-Hoc Tukey: In the Post-Hoc Tukey test we were able to identify that the means differ by roughly 5 million kilometers to 7 million kilometers.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = grouped_data$miss_distance ~ grouped_data$hazardous, data = neo_tyler)
##
## $`grouped_data$hazardous`
## diff lwr upr p adj
## True-False 5.84 5.04 6.63 0
6. Conclusion:
-Hazardous NEOs, on average, are slightly further away from Earth (roughly by 5-7 M kilometers) than those not hazardous.
-Hazardous NEOs vary less in their distance relative to Earth (σ < σ) than those not hazardous.
-Hazardous NEOs do not share the same sampling distribution of distance relative to Earth as those not hazardous.
Absolute Magnitude
Absolute magnitude is the third variable that we will be go over in detail. According to Wikipedia, it is defined as a measure of magnitude (brightness) of a celestial object as it would be seen at a standard distance of 10 parsecs. Absolute magnitude plays a very important role in determining if the NEO is hazardous or not. This is because the inverse square law states that the brightness of a light source decreases with the square of the size of that asteroid. Hence, this would indicate that absolute magnitude variable must have a certain relationship with the diameter variables of the asteroid (est_diameter_min and est_diameter_max). Also, from the official NASA documentation, it is worth pointing out that the absolute magnitude of the PHAs is always less than 22.0. Throughout the rest of the documentation, we shall go over plots and perform tests to prove whether the hazardous asteroid’s absolute magnitude lies below 22.0 H or no.
Before that, we shall explore some key characteristics of absolute_magnitude in detail:
Summary of absolute_magintude of hazardous asteroid:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 19.6 20.6 20.3 21.4 22.4
Summary of absolute_magnitude of non-hazardous NEO:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.2 22.1 24.1 23.9 25.9 33.2
Let’s find the mode
Mode of absolute_magnitude in neo dataset => 24.4
Mode of absolute_magnitude in hazardous dataset => 21.9
Mode of absolute_magnitude in non-hazardous dataset => 24.4
Measures of Variance and Standard Deviation
Variation of absolute_magnitude of hazardous NEO => 1.8
Variation of absolute_magnitude of non_hazardous NEO => 7.847
Standard deviation of absolute_magnitude of hazardous NEO => 1.342
Standard deviation of absolute_magnitude of non_hazardous NEO => 2.801
Graphical understanding using Plots
Note - We see that the absolute_magnitude for hazardous NEO lies mostly between 19 to 22. While that of the non-hazardous asteroids gnerally lies between 23 to 27.
This proves what was mentioned in NASA documentation that Potentially Hazardous Asteroids have an absolute magnitude of 22.0 H or less.
Outliers
Knowing if the dataset has outliers is a good way to know if the
dataset has any “bad data”. And it is essential to know or keep track of
the number of records that represent outliers. To keep track of the
outliers in the dataset, we shall use outlierKD2 function:
## Outliers identified: 231
## Proportion (%) of outliers: 2.7
## Mean of the outliers: 16.1
## Mean without removing outliers: 20.3
## Mean if we remove outliers: 20.4
## Outliers successfully removed
The dimension of the hazardous dataframe after removing outliers: 8840, 10
## Outliers identified: 446
## Proportion (%) of outliers: 0.5
## Mean of the outliers: 18.9
## Mean without removing outliers: 23.9
## Mean if we remove outliers: 23.9
## Outliers successfully removed
The dimension of the non_hazardous dataframe after removing outliers: 81996, 10
T-Tests
Now, with all outliers removed and distinct values extracted, we can perform t-tests.
T-Tests on hazardous dataframe:
##
## One Sample t-test
##
## data: hazardous_grouped$absolute_magnitude
## t = 670, df = 2172, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 20.1 20.3
## sample estimates:
## mean of x
## 20.2
T-Tests on non_hazardous dataframe:
##
## One Sample t-test
##
## data: non_hazardous_grouped$absolute_magnitude
## t = 1341, df = 25249, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 23.6 23.7
## sample estimates:
## mean of x
## 23.6
Contingency Table
Now, we shall validate the above by printing the Contingency table. The contingency table of hazardous is below:
##
## (16,16.5] (16.5,17] (17,17.5] (17.5,18] (18,18.5] (18.5,19] (19,19.5] (19.5,20]
## 21 23 39 58 107 134 191 240
## (20,20.5] (20.5,21] (21,21.5] (21.5,22] (22,22.5] (22.5,23]
## 296 302 352 368 17 0
We see a higher frequency of values falling in the interval 21-22 for hazardous asteroids. However, as noticed in the plots drawn, the plot is left skewed. Hence, the average/mean shifted a bit towards the left.
Here too, we can confirm that the PHAs in our dataset follow the official NASA documentation.
The contingency table of non_hazardous is below:
##
## (16,16.5] (16.5,17] (17,17.5] (17.5,18] (18,18.5] (18.5,19] (19,19.5] (19.5,20]
## 76 92 161 233 401 527 706 795
## (20,20.5] (20.5,21] (21,21.5] (21.5,22] (22,22.5] (22.5,23] (23,23.5] (23.5,24]
## 913 1015 933 933 1349 1397 1600 1744
## (24,24.5] (24.5,25] (25,25.5] (25.5,26] (26,26.5] (26.5,27] (27,27.5] (27.5,28]
## 1905 1908 1830 1556 1430 1162 854 625
## (28,28.5] (28.5,29] (29,29.5] (29.5,30] (30,30.5] (30.5,31] (31,31.5] (31.5,32]
## 404 247 160 98 50 22 11 10
We see a higher frequency of values falling in the interval 24-25 for non_hazardous.
Anova Test:
For anova test, we need 2 samples of equal size - one from hazardous and another from non-hazardous
## Df Sum Sq Mean Sq F value Pr(>F)
## hazardous 1 1.60e+08 1.6e+08 80793733 <2e-16 ***
## Residuals 2172 4.29e+03 2.0e+00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation - the p-value is a small number (<0.001) which indicates that we can reject the null hypothesis and confirm that the mean for hazardous and non-hazardous NEOs is not the same. Note that the p-value is same from results obtained in anova test and t-test.
Here, we finally prove that our dataset has the correct definition of PHAs and that they all have an absolute magnitude less than 22 H.
Correlation with Diameter
A correlation plot is used to determine how close each variable is to
the other. We know from earlier that the absolute magnitude is inversely
related to the size of the NEO. We shall draw a correlation plot with
est_diameter_min and est_diameter_max to verify and see if it does hold
significance or not?
As noted in the above correlation plot, we see that the the min and max estimated diameter is closely related to the absolute magnitude. It is inversely related. This makes sense because absolute magnitude (brightness of a NEO) indicates the size of the asteroid in question. And hence, we can say that the bigger the size of the asteroid, lesser the absolute magnitude (brighter the star), and thus potentially more hazardous is the asteroid.
Overall Correlation Plots:
After exploring each of these 3 variables, we shall now see their correlation with that of our target variable - hazardous. We shall draw both - Pearson and Spearman correlation plots.
Here, we see that hazardous variable has a higher correlation with absolute magnitude, and the diameter variables. This means that the size and brightness of the asteroid plays a very significant role in deciding if the asteroid is hazardous or not. These variables are followed by relative velocity and miss distance.
PROJECT 2
EDA ENDS
Splitting and Sampling
set.seed(123)
non_hazardous = non_hazardous[sample(nrow(non_hazardous),nrow(hazardous)),]
nasa_model_data= rbind(hazardous, non_hazardous)
nrow(nasa_model_data)## [1] 17680
## 75% of the sample size
smp_size <- floor(0.75 * nrow(nasa_model_data))
## set the seed to make your partition reproducible
set.seed(123)
train_data_size <- sample(seq_len(nrow(nasa_model_data)), size = smp_size)
nasa_train.labels = nasa_model_data[train_data_size,c("hazardous")]
nasa_test.labels = nasa_model_data[-train_data_size,c("hazardous")]
nasa_model_data = nasa_model_data %>% select(est_diameter_min, est_diameter_max, relative_velocity, miss_distance, absolute_magnitude)
nasa_train <- nasa_model_data[train_data_size,]
nasa_test <- nasa_model_data[-train_data_size,]Comment about why we are doing normalisation
Normalisation
nasa_train = scale(nasa_train, center = TRUE, scale = TRUE)
nasa_test = scale(nasa_test, center = TRUE, scale = TRUE)KNN
library(class)
loadPkg("gmodels")
loadPkg("gmodels")
loadPkg("FNN")
loadPkg("caret")
knn.5 <- knn(train=nasa_train, test=nasa_test, cl=nasa_train.labels, k=5)
ACC.5=100 * sum(nasa_test.labels == knn.5)/NROW(nasa_test.labels)
paste0("Accuracy = ", ACC.5)## [1] "Accuracy = 85.0904977375566"
print(confusionMatrix(knn.5, as.factor(nasa_test.labels) ))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1701 178
## 1 481 2060
##
## Accuracy : 0.851
## 95% CI : (0.84, 0.861)
## No Information Rate : 0.506
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.701
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.780
## Specificity : 0.920
## Pos Pred Value : 0.905
## Neg Pred Value : 0.811
## Prevalence : 0.494
## Detection Rate : 0.385
## Detection Prevalence : 0.425
## Balanced Accuracy : 0.850
##
## 'Positive' Class : 0
##
ResultDf = data.frame( k=numeric(0), Total.Accuracy= numeric(0), row.names = NULL )
for (kval in seq(11,50,2)) {
nasa_pred <- knn(train = nasa_train, test = nasa_test, cl=nasa_train.labels, k=kval)
#NASA_PREDCross = CrossTable(nasa_test.labels, nasa_pred, prop.chisq = FALSE)
cm = confusionMatrix(nasa_pred, as.factor(nasa_test.labels) ) # from caret library
cmaccu = cm$overall['Accuracy']
cmt = data.frame(k=kval, Total.Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics
ResultDf = rbind(ResultDf, cmt*100)
}
xkabledply(ResultDf, "Total Accuracy Summary:")| k | Total.Accuracy |
|---|---|
| 1100 | 86.3 |
| 1300 | 86.8 |
| 1500 | 86.8 |
| 1700 | 86.7 |
| 1900 | 86.7 |
| 2100 | 87.0 |
| 2300 | 87.0 |
| 2500 | 87.1 |
| 2700 | 87.2 |
| 2900 | 87.2 |
| 3100 | 87.1 |
| 3300 | 87.2 |
| 3500 | 87.1 |
| 3700 | 87.1 |
| 3900 | 87.1 |
| 4100 | 87.1 |
| 4300 | 87.0 |
| 4500 | 86.9 |
| 4700 | 86.8 |
| 4900 | 86.9 |
ggplot(data=ResultDf, aes(x=k, y=Total.Accuracy,group=1)) +
geom_line(color="#aa0022", size=1.75) +
geom_point(color="#aa0022", size=3.5) +
ggtitle("K value against their accuracy in KNN model") +
labs(x="K value", y="Accuracy of K value") +
theme(axis.title.y = element_text(size=10, family="Trebuchet MS", color="#666666")) +
theme(axis.text = element_text(size=14, family="Trebuchet MS")) +
theme(plot.title = element_text(size=20, family="Trebuchet MS", face="bold", hjust=0, color="#666666"))
From the above, we see that when k=32, we get the highest accuracy of
87.2%. Also, noticed a higher sensitivity in the model. Why? IDK.
SVM
# install.packages("e1071")
library("e1071")
classifier = svm(formula = nasa_train.labels ~ .,
data = nasa_train,
type = 'C-classification',
kernel = 'linear',cost = 10)
y_pred = predict(classifier, newdata = nasa_test)
cm = confusionMatrix(y_pred, as.factor(nasa_test.labels) )
cm## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1612 48
## 1 570 2190
##
## Accuracy : 0.86
## 95% CI : (0.85, 0.87)
## No Information Rate : 0.506
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.719
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.739
## Specificity : 0.979
## Pos Pred Value : 0.971
## Neg Pred Value : 0.793
## Prevalence : 0.494
## Detection Rate : 0.365
## Detection Prevalence : 0.376
## Balanced Accuracy : 0.859
##
## 'Positive' Class : 0
##
Random Forest
nasa_model_data_randomforest= rbind(hazardous, non_hazardous)
nasa_model_data_randomforest$hazardous = as.factor(nasa_model_data_randomforest$hazardous)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(nasa_model_data_randomforest))
## set the seed to make your partition reproducible
set.seed(123)
train_data_size <- sample(seq_len(nrow(nasa_model_data)), size = smp_size)
nasa_train_randomforest <- nasa_model_data_randomforest[train_data_size,]
nasa_test_randomforest <- nasa_model_data_randomforest[-train_data_size,]
library(randomForest)
library("pROC")
classifier <- randomForest( formula = hazardous~., data=nasa_train_randomforest)
print(classifier)##
## Call:
## randomForest(formula = hazardous ~ ., data = nasa_train_randomforest)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 9.69%
## Confusion matrix:
## 0 1 class.error
## 0 5541 1117 0.1678
## 1 168 6434 0.0254
pred_1 = predict(classifier, nasa_test_randomforest)
paste0("Accuracy = ",mean(nasa_test_randomforest$hazardous == pred_1))## [1] "Accuracy = 0.880769230769231"
nasa_test_randomforest$prob = pred_1
nasa_test_randomforest$hazardous = as.numeric(nasa_test_randomforest$hazardous)
nasa_test_randomforest$prob = as.numeric(nasa_test_randomforest$prob)
h <- roc(hazardous~prob, data=nasa_test_randomforest)
print(h)##
## Call:
## roc.formula(formula = hazardous ~ prob, data = nasa_test_randomforest)
##
## Data: prob in 2182 controls (hazardous 1) < 2238 cases (hazardous 2).
## Area under the curve: 0.88
plot(h,main="ROC Curve for Random Forest",col=2,lwd=2)varImpPlot(classifier)
Classifier estimate of error = 9.7%. Which means that accuracy is 90.3%.
Area under curve is 0.908. Very good fit model.
Limitations
The neo.csv data set posed limitations in the variables,
description and dictionary. Since the dataset was obtained from a
web-based environment, kaggle, we had insufficient variables, a poor
description of the dataset and no data dictionary provided. In terms of
the variables provided, it would have been ideal to have one regarding
the trajectory towards, or away, from earth, as well as other
potentially hazardous, or not hazardous, variables. Nevertheless, we
were restricted to just using the three variables that were quantifiable
in order to characterize the potential of it being hazardous. In
retrospect, just using these three variables might have been an
over-simplification in our determination of a hazard. On the other hand,
the description of the dataset is very limited given the constraint of
using Kaggle. It would have been better if the dataset was obtained
straight from a NASA website or API, however this was not possible. It
would have been a huge perk to know how these variables are being used
by NASA professionals, what characterizes an instance to be recorded
multiple times, and why they initially attributed the boolean value for
Hazardous. Finally, when it came to providing a data dictionary, the
kaggle website did not have one. Thus, we had to do our own research as
to what these variables meant, and what they were describing. Therefore,
with insufficient variables, a poor description and no data dictionary
it was difficult to infer, or fully understand, the root of a hazardous
or not hazardous asteroid to fully answer our question.
References:
- Kaggle Dataset (Kaggle): Vani, S. (2022, June 17). NASA - Nearest Earth objects. Kaggle. Retrieved November 2, 2022, from https://www.kaggle.com/datasets/sameepvani/nasa-nearest-earth-objects
- Center for Near Earth Objects Studies (CNEOS): NASA. (n.d.). Neo basics. NASA. Retrieved November 2, 2022, from https://cneos.jpl.nasa.gov/about/neo_groups.html
- NASA (NASA): Bardan, R. (2022, October 11). NASA confirms Dart Mission Impact Changed Asteroid’s motion in space. NASA. Retrieved November 2, 2022, from https://www.nasa.gov/press-release/nasa-confirms-dart-mission-impact-changed-asteroid-s-motion-in-space
- Araujo, R. A. N., & Winter, O. C. (2014). Near-Earth asteroid binaries in close encounters with the Earth. Astronomy & Astrophysics, 566, A23.
- Peter O. K. Krehl (2008). History of Shock Waves, Explosions and Impact: A Chronological and Biographical Reference. Springer Science & Business Media. p. 404. ISBN 978-3-540-30421-0.
- Tonry, J. L. (2010). An early warning system for asteroid impact. Publications of the Astronomical Society of the Pacific, 123(899), 58.